Space  Contents Previous Next


Filtering

CiDaemon Process
Filter DLLs
Associating File Types with Extensions
Word-Breaker DLLs
Noise Words
CiDaemon Priority Settings
Related Performance Counters
Disk Full Condition


Microsoft Index Server filters documents by inserting data from the document files into content indexes. Content filters break documents into words (keys) and create word lists, which supply raw data for the index. Filtering is a three-step process:

  1. A filter DLL (dynamic link library) extracts the text and properties out of a document.
  2. A word-breaker DLL parses the text and textual properties into words.
  3. Noise words (also known as Stop Words) are removed from the data extracted from the document and the remaining words are stored in the index.

To TopCiDaemon Process

The CiDaemon process is a child process created by the Microsoft Index Server engine. The Index Server engine gives a list of documents to the CiDaemon process and it is responsible for filtering the documents by identifying the correct filter DLL and word-breaker DLL associated with a specific document.

Filtering is done as a background activity so as not to interfere with any foreground activity. On local drives, if a document opened by the CiDaemon process for reading is needed by another process for writing, the CiDaemon process closes the document as soon as possible. The document will be retried for filtering at a later time. (This feature is not available on network shares.)

If the CiDaemon process stops, it will be automatically restarted by the Index Server engine.


To TopFilter DLLs

A filter DLL “understands” one or more document formats and is capable of extracting text and properties out of those document types. A filter DLL implements the IFilter ActiveX interface. The CiDaemon process uses the IFilter interface to extract the text out of a document. To track down a problem with a filter DLL, an administrator needs to know where to look to find out the filter DLL for a particular document. Editing the registry is also a good way to avoid filtering documents with no useful content.


Caution   Editing the registry incorrectly can cause serious problems, including corruption that may make it necessary to reinstall Windows NT or Microsoft Index Server. Using the Registry Editor to edit entries in the registry is equivalent to editing raw sectors on a hard disk. If you make mistakes, your computer’s configuration could be damaged. You should edit registry entries only for settings that you cannot adjust through the user interface, and be very careful whenever you edit the registry directly.


Document types and the associated filter DLL entries are specified in the registry under the \HKEY_LOCAL_MACHINE\Software\Classes tree. To find out the filter DLL associated with a particular document type, navigate through the registry entries in the \HKEY_LOCAL_MACHINE\Software\Classes tree.

The four steps to find out the filter DLL for a document follow. The example is for HTML files.

Step 1: Determine the CLSID

Find the CLSID associated with the document type under the registry key \HKEY_LOCAL_MACHINE\SOFTWARE\Classes. Let this be <Value1>.

\HKEY_LOCAL_MACHINE\SOFTWARE\Classes
    htmlfile
        = Class for WWW HTML files
        CLSID
            = {25336920-03F9-11CF-8FD0-00AA00686F13}

Step 2: Determine the Persistent Handler

Using <Value1> found out in Step 1, find the PersistentHandler value for the \HKEY_LOCAL_MACHINE\SOFTWARE\Classes\CLSID\<Value1> key. Let this be <Value2>.

\HKEY_LOCAL_MACHINE\SOFTWARE\Classes\CLSID
        {25336920-03F9-11CF-8FD0-00AA00686F13}
            = WWW HTML files
            PersistentHandler
                = {EEC97550-47A9-11CF-B952-00AA0051FE20}

Step 3: Determine the IFilter Persistent Handler GUID

Using <Value2> determined in Step 2, find the IFilter Persistent Handler GUID for the document type. The value under the key \HKEY_LOCAL_MACHINE\SOFTWARE\Classes\CLSID\<Value2>\PersistentAddinsRegistered\
89BCB740-6119-101A-BCB7-00DD010655AF yields the IFilter Persistent Handler GUID for this document type. Let this be <Value3>. 89BCB740-6119-101A-BCB7-00DD010655AF is the IFilter interface GUID.

\Registry\Machine\Software\Classes\CLSID
      {EEC97550-47A9-11CF-B952-00AA0051FE20}
           = REG_SZ HTML File Persistent Handler
        PersistentAddinsRegistered
            {89BCB740-6119-101A-BCB7-00DD010655AF}
                = REG_SZ {E0CA5340-4534-11CF-B952-00AA0051FE20}                

Step 4: Determine the Filter DLL

Using <Value3> determined in Step 3, the filter DLL can be found under the entry \HKEY_LOCAL_MACHINE\SOFTWARE\Classes\CLSID\<Value3>\InprocServer32.

\Registry\Machine\Software\Classes\CLSID
     {E0CA5340-4534-11CF-B952-00AA0051FE20}
        = REG_SZ HTML Filter
        InprocServer32
            = REG_SZ htmlfilt.dll

In this example, the filter DLL for HTML documents is Htmlfilt.dll.


To TopAssociating File Types with Extensions

File types are associated with file extensions under the \HKEY_LOCAL_MACHINE\SOFTWARE\Classes tree. Following are the associations for htmlfile document type:

\HKEY_LOCAL_MACHINE\SOFTWARE\Classes
   .htm
        = REG_SZ htmlfile
    .html
        = REG_SZ htmlfile
    .htx
        = REG_SZ htmlfile
    .stm
        = REG_SZ htmlfile

By default, the extensions listed above are considered to be htmlfile documents. To add another extension to this list, an entry must be created in the registry associating that extension with htmlfile type. For example, to treat .htx files as htmlfile type, add the following entry:

\HKEY_LOCAL_MACHINE\SOFTWARE\Classes
   .htx
        = REG_SZ htmlfile

Adding Filter DLLs

To add new filter DLLs, please refer to the documentation provided with the filter DLLs.

Removing Filter DLLs

To remove a filter DLL, the IFilter PersistentHandler entry associated with a document type and the filter DLL entry must be deleted. Please refer to the Filter DLLs section to see how to find out a IFilter PersistentHandler for a particular document type.

For example, to remove the installed Htmlfilt.dll, the following two entries must be removed:

\Registry\Machine\Software\Classes\CLSID
     {EEC97550-47A9-11CF-B952-00AA0051FE20}
        PersistentAddinsRegistered
            {89BCB740-6119-101A-BCB7-00DD010655AF}
                = REG_SZ {E0CA5340-4534-11CF-B952-00AA0051FE20}                
\Registry\Machine\Software\Classes\CLSID
     {E0CA5340-4534-11CF-B952-00AA0051FE20}
        = REG_SZ HTML Filter
        InprocServer32
            = REG_SZ htmlfilt.dll

Binary Files - NULL Filter

When a registered binary file is encountered, the NULL filter is used. The NULL filter retrieves only the system properties. The contents of a binary file are not filtered. Examples of system properties are the FileName, last Write time, file Size, Attributes, and so on.

A file with a certain extension is considered to be a binary file if its type in the registry is set to BinaryFile. For example, to associate the extension .lib with the binary file type, add the following entry to the registry:

\HKEY_LOCAL_MACHINES\Software\Classes
  \.lib
        = REG_SZ BinaryFile

The class BinaryFile is a predefined type that uses the NULL filter for its IFilter implementation.


Warning   If the extension for which you wish to use the NULL filter already has a file type, do not change it to BinaryFile. Doing so could damage your Windows NT installation. Instead, use the following procedure to set the implementation of the IFilter interface for the file type.


When a file extension already has a file type, use the previous procedure to lookup the PersistentAddinsRegistered key and set the IFilter interface implementation. The example below is for files with the extension .dll.

Step 1: Determine the file type

Find the file type associated with the file extension .dll.

\HKEY_LOCAL_MACHINE\Software\Classes
        \.dll
            = REG_SZ dllfile

Step 2: Determine the CLSID

Look up the CLSID associated with the dllfile type in the registry.

\HKEY_LOCAL_MACHINE\Software\Classes
        dllfile
            = REG_SZ Application Extension
            CLSID
                = REG_SZ {3cf51a00-84eb-11ce-ac07-00004c752752}

Step 3: Determine the Persistent Handler

Look up the persistent handler GUID for the CLSID in the registry. If there is no persistent handler, set it to the CLSID for the persistent handler of the NULL filter, “{098F2470-BAE0-11CD-B579-08002B30BFEB}”. Otherwise, continue with the next step.

\HKEY_LOCAL_MACHINE\Software\Classes
        CLSID
            {3cf51a00-84eb-11ce-ac07-00004c752752}
                PersistentHandler
                    = REG_SZ {098F2470-BAE0-11CD-B579-08002B30BFEB}

Step 4: Set the IFilter Persistent Handler

Look up the CLSID found in the step above and set the IFilter handler {89BCB740-6119-101A-BCB7-00DD010655AF} to the NULL filter GUID {C3278E90-BEA7-11CD-B579-08002B30BFEB}.

\HKEY_LOCAL_MACHINE\Software\Classes
        CLSID
            {098F2470-BAE0-11CD-B579-08002B30BFEB}
                 PersistentAddinsRegistered
                     {89BCB740-6119-101A-BCB7-00DD010655AF}
                          = REG_SZ {C3278E90-BEA7-11CD-B579-08002B30BFEB}

Here is a list of default extensions for binary files:

.aif,.avi,.cgm,.com,.dct,.dic,.dll,.exe,.eyb,.fnt,.ghi,.gif,
.hqx,.ico,.inv,.jbf,.jpg,.m14,.mov,.movie,.mv,
.pdf,.pic,.pma,.pmc,.pml,.pmr,.psd,.sc2,
.tar,.tif,.tiff,.ttf,.wav,.wll,.wlt,.wmf,.z,.z96,.zip

Default Filter

In Index Server, a default filter filters both the system properties (such as file name) and the contents of a file. The default filter does not “understand” any document formats; when filtering the contents of a file, it treats the file as a sequence of characters. Index Server uses the default filter when a extension of a file has no association in the registry, and if the value of the registry setting FilterFilesWithUnknownExtensions is 1.

Note   The default filter filters plain text and files of unknown origin. It assumes all text to be in the default codepage of the server.

Corrupted Files

If a file is corrupted, the filter DLL may not be able to properly interpret the contents of that file. To get a list of files that could not be filtered, see Unfiltered Files. An event is also written to the event log. Sometimes a file cannot be filtered because of a defective third-party filter DLL. After verifying the contents of a file, an administrator should report the problems to the filter DLL vendor. Files protected by passwords are not filtered.

Maximum Retries

If a document cannot be filtered, it will be retried a certain maxium number of times. If the document still cannot be filtered, then it will be considered to be an unfiltered file. The registry key FilterRetries controls the maximum number of retries for a document.

To get a list of all the files that could not be filtered, issue the query @unfiltered = TRUE.

Unknown Extensions

A file with an extension that does not have an association in the registry is treated as an Unknown Extension. The behavior of Index Server depends upon the registry setting FilterFilesWithUnknownExtension. If this value is set to 0, then the NULL Filter is used to filter those files. Otherwise, the default filter DLL is used to filter the contents.

Filtering Directories

By default, directories are not filtered and will not appear in query results. To filter directories, set the registry key FilterDirectories to 1. When directories are filtered, their system properties are filtered.

Characterization

CiDaemon process is capable of automatically generating summaries or characterization (also called abstract) for documents. If the registry key GenerateCharacterization is set to 1, the characterization will be automatically generated. The maximum number of chatacters in the generated characterization is controlled by the registry key MaxCharacterization.

Preinstalled Filter DLLs

The list of document types for which filter DLLS are preinstalled is given below:


To TopWord-Breaker DLLs

A word-breaker DLL parses the text and textual properties returned by the filter DLL into words. The word-breaker DLL is language dependent. The following languages are supported by Microsoft Index Server:


To TopNoise Words

Words that are not significant for searching are called noise words or stop words. Noise words are stored in %systemroot%\system32 directory in various noise word files (Noise.enu, by default). The noise word files are language dependent. The noise word file for a particular language is specified in the registry under the key:

HKEY_LOCAL_MACHINE\SYSTEM
\SYSTEM
 \CurrentControlSet
  \Control
   \ContentIndex
    \Language
     \<
language>
    
 \NoiseFile

For example, the noise word file for English_US is listed as the registry key:

HKEY_LOCAL_MACHINE\SYSTEM
\SYSTEM
 \CurrentControlSet
  \Control
   \ContentIndex
    \Language
     \English_US
      
\NoiseFile
       \noise.enu

The noise word files can be edited with a text editor to either add new words or remove words that are not considered “noise” at a particular installation. Note that querying for noise words will not yield any hits.

Removing all noise words from the noise word files can significantly increase the size of indexes.


To TopCiDaemon Priority Settings

The CiDaemon priority is controlled by two settings:

ThreadClassFilter specifies the priority class of the filter daemon. The possible values are:

NORMAL_PRIORITY_CLASS 0x00000020
IDLE_PRORITY_CLASS (default)0x00000040
HIGH_PRIORITY_CLASS 0x00000080
REALTIME_PRIORITY_CLASS 0x00000100

ThreadPriorityFilter specifies the priority in the specific class. The possible values are:

THREAD_PRIORITY_LOWEST -2
THREAD_PRIORITY_BELOW_NORMAL-1
THREAD_PRIORITY_NORMAL 0
THREAD_PRIORITY_ABOVE_NORMAL (default)+1
THREAD_PRIORITY_HIGHEST +2

By default the CiDaemon process is set to run in the idle priority class to prevent interference with normal foreground activity. On a busy server, this might result in the files never being filtered. To run the CiDaemon process as a normal process, set the ThreadClassFilter to NORMAL_PRIORITY_CLASS and ThreadPriorityFilter to THREAD_PRIORITY_NORMAL. Setting ThreadClassFilter to HIGH_PRIORITY_CLASS or REALTIME_PROCESS_CLASS is not recommended because it may interfere with normal activity on the system.


To TopRelated Performance Counters

The following counters are present under the performance monitor object Content Index.

Counter Name Explanation
# documents filtered The number of documents filtered since the indexing was started in the current process instantiation. Note that this does not include the documents filtered in prior runs of Index Server.
Files to be filteredThese are the files remaining to be filtered.
Total # of documentsTotal number of documents known to the index.

The following counters are present under the perfmon object Content Index Filter

Counter Name Explanation
Binding TimeAverage time (in milliseconds) to bind to a filter DLL.
Filter SpeedSpeed (in megabytes per hour) at which documents are filtered.
Total Filter SpeedSpeed (in megabytes per hour) at which documents are indexed. This includes both the time to filter document contents, plus time to filter properties and generate abstracts.

To TopDisk Full Condition

If the free disk space on the index disk starts running low (less than 3 MB), filtering will be temporarily paused. A disk-full event will be written to the event log. The administrator should free up disk space by deleting or moving files from that drive.


 Contents Previous Top Next


© 1996 by Microsoft Corporation. All rights reserved.